On the use of historical estimates

The use of historical, i.e., already existing, estimates in current studies is common in a wide variety of application areas. Nevertheless, despite their routine use, the uncertainty associated with historical estimates is rarely properly accounted for in the analysis. In this communication, we review common practices and then provide a mathematical formulation and a principled frequentist methodology for addressing the problem of drawing inferences in the presence of historical estimates. Three distinct variants are investigated in detail; the corresponding limiting distributions are found and compared. The design of future studies, given historical data, is also explored and relations with a variety of other well-studied statistical problems discussed.


Introduction
There are many circumstances in which a statistical analysis either requires, or can greatly benefit, from the use of historical, that is existing, information. In this paper we focus on the situation where the historical information consists of parameter estimates. These may be essential for model fitting but impossible, and/or very expensive, to collect in the context of the current study. Although related, we will not explicitly discuss the large literature on data combination schemes or other two-stage plugin methods. An example of the former is Ridder and Moffitt (2007) comprehensive review of methodologies for data combination common in econometrics whereas an example of the latter is Genest et al. (1995) seminal paper on two-step semi-parametric inference in copula models. These and similar problems have received considerable attention and have a long history in Statistics, see, e.g., Cochran (1954).
Historical estimates are used in a variety of applications in the social, physical, and biomedical sciences. For example, some models for the spread of infectious diseases, such as the SIR model (Becker 2017) popularized in connection with COVID-19, e.g., Cooper et al. (2020), require the input of age-specific transmission parameters which can be estimated from social contact networks (Edmunds et al. 1997;Wallinga et al. 2006) and then used to fit epidemic models (Mossong et al. 2008;Goeyvaerts et al. 2010, Yaari et al. 2018. Another interesting application is the optimization of cancer treatment where Kronik et al. (2010) develop a framework for predicting the outcome of prostate cancer immunotherapy by fitting personalized mathematical models. Their model consists of a set of differential equations whose behavior is governed by a collection of parameters, some of which are global parameters while others are subject specific. The values of the global parameters were obtained from at least ten different published studies, see their Table 2, whereas the subject level parameters were estimated by fitting a model to each participant assuming that the global parameters were estimated without error. See Kogan et al. (2012) and Kozłowska et al. (2018) for similar applications. It is worth noting that the applications above may be viewed as a model for situations in which knowledge collected in one setting, experimental or observational, is then used to estimate quantities arising in a different experiment and is quite common in the biomedical sciences. See Lee and Zelen (1998) and Davidov and Zelen (2004) for a similar structure arising in the planning of early detection programs.
Another very important application in which historical estimates are used is clinical trials. Consider, for example, the situation in which the effect of a combination of treatments is assessed (e.g., Tamma et al. 2012, Kanda et al. 2016). In such cases there exists a collection of therapies which have been independently proven to be somewhat successful at treating a medical condition. The objective of a new study may then be to assess whether a combination of these therapies provides an even better outcome. In the simplest case, one may view this problem as a three armed clinical trial comparing treatments A,B and A + B in which historical estimates on the efficacy of treatments A and B already exist. An important example of such situations is the Food and Drug. Administration (2006) guidelines for submitting applications for approval of fixed dose combinations, i.e., co-packaged drug products, of previously approved antiretrovirals for the treatment of HIV. In particular, Attachment A of the aforementioned document considers the scenario in which a non-innovator, i.e., a generic drug company, wants to obtain approval for a combination of already approved ingredients. In this case, only efficacy data for the combination needs to be submitted. We will revisit and thoroughly analyze two forms of this example later on. More broadly, the use of historical data in the contexts of clinical trials has been investigated by numerous researchers using multiple perspectives, cf., Pocock (1976), Peto et al. (1976), Neuenschwander et al. (2010), Viele et al. (2014), and Piantadosi (2017) among many others. As noted by a referee, a particularly relevant class of designs are platform trials which allow adding new treatments to the experiment and thus controls may become non-concurrent, cf. Lee and Wason (2020) and Roig et al. (2022).
The use of historical estimates is also widespread in the social sciences. For example, in the fitting of some econometric models researchers may use values estimated from previously collected survey data. We point to the paper of Newey et al. (2005) which focuses on the asymptotic bias of the estimated parameters. The complexity of using historical estimates in the social sciences is further illustrated by the work of Tasseva (2019). In a microsimulation study investigating the effect of the recent expansion in higher education in Great Britain on household inequalities, previously obtained estimates from the Family Resources Survey for Great Britain (GOV.UK 2019) were used. While sampling variability could be taken into account using bootstrap methods, as noted by the author, measurement error, inevitably present in income information collected in surveys, see, e.g., Moore et al. (2000), could not be accounted for using this method. Similarly, Douidich et al. (2016) describe an imputation-relatedmethod for incorporating estimates obtained in labor force surveys (which are easily and cheaply conducted) into household expenditure surveys (which are much more time consuming and expensive) in order to estimate poverty rates in Morocco. Likewise, demographic model fitting and projections rely on historical data. The standard method of population projections (see United Nations 2014) is based on the combination of cohort survival rates, i.e., historical data, with current data on cohort sizes. Raftery et al. (2014) proposed a Bayesian approach to take the uncertainty associated with historical data into account. It is worth noting that in this case the uncertainty accounted for by Bayesian modeling did not come from observational errors but rather from the fact that the true population figures may have changed over time.
Researchers often do not adequately account for the variability of the historical estimates when incorporating them into a current analysis. In fact, the practice of plugging-in the estimated values for certain parameters is widespread. However, this practice is often not disclosed as many practitioners view this strategy as a natural way of "doing things". Consequently, the objectives and contributions of this communication are twofold: first, we draw attention to current practice, and secondly, and more importantly, we provide a principled methodology for incorporating historical estimates into a current analysis. Surprisingly, despite the ubiquity of historical data and estimates and the many papers that touch on various aspects thereof, a general methodology discussing the use of historical estimates, as given here, has been thus far lacking. In particular we consider two broad settings in which historical estimates are employed. Three such estimators are presented in Sect. 3 and one in Sect. 4.1. Their limiting distributions are found and a theoretical analysis comparing their precision is conducted. When comparable, a preference order among the different approaches is established. Our paper goes beyond the existing knowledge by providing an inventory of ways in which historical estimates can be used, and by quantifying the properties of the resulting estimators. We also demonstrate how these results may be used in the design of experiments.
The paper is organized as follows. Our notation and formulation are outlined in Sect. 2. Section 3 provides our main theoretical findings which include the limiting distributions of the estimates in the presence of historical estimates and a comparison thereof. In Sect. 4 two applications are described in conjunction with accompanying numerical experiments. The first application addresses the two-way analysis of variance (ANOVA) problem introduced in Sect. 2. The second, related application, deals with a drug interaction study within the framework of Bliss-independence (Bliss 1939), an old concept which has garnered much recent attention. We conclude with a discussion in Sect. 5. All proofs are collected in an Appendix.

Notation and formulation
Consider a designed experiment or observational study, denoted by S, in which data D consisting of n observations are collected. Usually, the observations are independent and identically distributed. Suppose further that the model describing the distribution of D is indexed by ω T = (θ T , η T ) where θ ∈ R p and η ∈ R q is the concatenation of η 1 ∈ R q 1 , . . . , η K ∈ R q K with q = q 1 + · · · + q K . Let Φ(ω) be some function of the model parameters which is of interest to the researchers. Clearly, Φ(ω) may be a function of θ alone, η alone or of both θ and η. The primary goal of the study S, which we refer to as the current study, is inference on Φ(ω) in the presence of historical data which we view as a collection of K , independent estimates η 1 , . . . , η K obtained from historical studies S 1 , . . . , S K of sizes m 1 , . . . , m K and m = m 1 + · · · + m K denotes the total sample size in the historical studies.
In some circumstances, it may not be possible to estimate ω using the data D. However, if η were known in advance then it would be possible to estimate θ . As an example, such a situation would arise if the model f (·; θ , η) is not identifiable whereas the model f (·; θ , η 0 ) is identifiable for every fixed value of η 0 . In other circumstances given the data D both θ and η are estimable (e.g., Peddada et al. 2007). Thus, in this communication we consider two distinct settings, the second of which has two variants. In the first setting, referred to as a Type I Problem, only the parameter θ is estimable using the data D, while (η 1 , . . . , η K ) are fixed at their historical estimated values ( η 1 , . . . , η K ). In the second setting, referred to as Type II Problem, both θ and η are estimable using D and a two-step procedure is utilized to estimate θ while updating the estimates for (η 1 , . . . , η K ). It may also happen that the available data corresponds to a Type II problem and while a Type II analysis would be possible, the researcher may decide to conduct a Type I Analysis, i.e., estimate θ as if the data came from a Type I Problem. One of our results shows that this is an inferior strategy, i.e., if the data D identifies ω it is always advisable to re-estimate η and the loss of precision is quantified in terms of a simple decomposition of the variance matrices of the resulting estimators. It is also important to emphasize that there are situations in which the investigator, by means of the design of the study S, may control whether the problem is of Type I or a Type II.
To fix ideas consider the two-way ANOVA model in which the expected value of an outcome Y is given by where for i = 1, 2, T i ∈ {0, 1} indicates whether treatment i is administered. Here η 0 denotes the mean of Y when neither treatment is administered, η i models the marginal increase in the expectation of Y when treatment i is administered and θ models the interaction T 1 × T 2 . Suppose, now that the historical data consists of two studies S 1 and S 2 of sizes m 1 and m 2 , respectively, where in the study S i treatment i was compared with a control. Clearly the historical data provides no information on θ . Thus inference on θ would require a new study S in which T 1 = T 2 = 1 for some subset of the observations. For simplicity, interchangeability is assumed, i.e., all experimental units, in S 1 and S 2 as well as S, are assumed to be drawn from the same population, e.g., Peddada et al. (2007), and therefore any change in the mean response may be attributed solely to the treatment combination received. The assumption of interchangeability may be relaxed as discussed in Sect. 5. One objective of this communication is to provide a methodology for effective design and analysis of a new study S of size n which allows the estimation of θ and utilizes the historical estimates of (η 0 , η 1 , η 2 ) obtained from S 1 and S 2 . Depending on its objectives, the study S may be of various forms. For example, one may choose to allocate all n observations to receive both treatments, i.e., T 1 = T 2 = 1. In this case, the data D is an IID sample of observations with mean η 0 + η 1 + η 2 + θ and variance σ 2 . Although the parameter θ is not identifiable from D alone it is estimable given the historical data, so this is clearly a Type I Problem. Alternatively, if S allocates observation to all treatment combinations then θ as well as (η 0 , η 1 , η 2 ) are estimable from S and this falls within the framework of a Type II Problem. This example will be further analyzed in Sect. 4.1.

Results
Our main theoretical findings, i.e., Theorems 3.1, 3.2, and 3.4 describe the limiting distributions of estimators for ω which are then compared in Theorems 3.3, 3.5 and 3.6. Remark 7 provides a brief summary of the results of this Section.

Type I problems
Suppose first that we are in the setting of a Type I Problem. Recall that in such circumstances only θ is estimated while (η 1 , . . . , η K ) are fixed at their historical values. Thus, letθ A solve where η = ( η T 1 , . . . , η T K ) T . The estimating Eq. (2) may be a score equation motivated by likelihood theory, a generalized estimating equation derived by quasi-likelihood or any other statistical estimation framework. Observe that the solutionθ A of (2) is obtained by plugging-in the sample values of the K independent estimators η 1 , . . . , η K . For simplicity we may further assume that the data D is a random sample Y 1 , . . . , Y n and (2) is of the form The function ψ is assumed to be: (i) continuously differentiable with respect to both θ and η 1 , . . . , η K ; it is further assumed to satisfy (ii) E 0 (ψ) = 0; (iii) E 0 (ψψ T ) < ∞; (iv) the matrix E 0 (∂ψ/∂η) exists; and (v) the matrix E 0 (∂ψ/∂θ) exists and is invertible. Here E 0 (·) denotes the expectation taken at ω 0 = (θ 0 , η 0 ) = (θ 0 , η 1,0 , . . . , η K ,0 ), the true value of all parameters. Conditions (i) − (v) are all stan-dard regularity conditions often imposed in the literature (cf., Heyde 2008, Van der Vaart 2000. We now have the following: Theorem 3.1 Letθ A be a solution to (2) and setη A = η. Assume that: (i)θ A is consistent at ω 0 ; (ii) the estimating function ψ satisfies the regularity conditions listed above; and (iii) the historical estimates satisfy are independent of each other and of the current study. Then if (m/m j ) → κ j < ∞ for all j = 1, . . . , K as m j → ∞ and n/m → ρ ∈ (0, ∞) as n → ∞ we have Remark 1 Clearly, A θθ is the p × p asymptotic variance matrix ofθ A , A ηη is the q ×q asymptotic variance matrix ofη A and A θη = A T ηθ is their p ×q asymptotic covariance matrix.
Remark 2 As pointed out by a referee the application of Theorem 3.1 is predicated on the fact that the historical studies report the estimates of the variance matrices Σ 1 , . . . , Σ K .
The proof of Theorem 3.1 is a straightforward, but somewhat involved, application of the delta method. In contrast with Randles (1982) and Pierce (1982) which describe the limiting distribution of statistics that are explicit functions of estimated parameters, the estimatorθ A is an implicit function ofη A . For a related but less general result see Benichou and Gail (1989). Further note that ( D −1 θ 0 )Σ ψ ( D −1 θ 0 ) T is the asymptotic variance ofθ A when the true values of η 1 , . . . , η K are known in advance. Thus the term ρ D η 0 Σ D T η 0 may be viewed as the penalty for substituting estimates for the true values of the parameters. The penalty may also be rewritten as ρ K j=1 κ j D j Σ j D T j where D j = E 0 (∂ψ/∂η j ) which expresses its dependence on the relative sample sizes, the asymptotic variances of the historical estimators and the sensitivity of the estimation procedure with respect to the historical estimates, embodied in the matrices D 1 , . . . , D K .

Remark 3
Note that if ρ is very small which occurs when m n, then the penalty is inconsequential, i.e., the asymptotic variance ofθ A is close to its variance when η 1 , . . . , η K are fully known.

Type II problems
Next, consider the case where both θ and η are estimable using the data D observed in the current study S. Further assume that two estimating functions Ψ and Γ are available to us; Ψ is an estimating function for θ given a known fixed value of η, as in Type I Problems, whereas Γ is an estimating function for η given a known fixed value of θ . For example, within the likelihood framework Ψ is the score with respect to θ while Γ is the score with respect to η.
We propose estimating ω using a two step procedure. In the first step the data D is used to obtain the pair (θ ,η) T which simultaneously solve Under standard regularity conditions, cf., the conditions listed just before the statement of Theorem 3.1, the estimators (θ ,η) T satisfy where Υ is assumed to be a non-singular variance matrix which can be consistently estimated from the data by, sayΥ , the standard sandwich estimator (Van der Vaart 2000). For convenience, we may partition Υ as where Υ θθ and Υ ηη denote the marginal asymptotic variances ofθ andη, respectively, and Υ θη is their asymptotic covariance. Naturally, a similar partition holds forΥ . Furthermore, as in Sect. 3.1, at our disposal are K independent historical estimates of η 1 , . . . , η K obtained using studies of sizes m 1 , . . . , where, again, it is assumed that Σ j are non-singular and can be consistently estimated for all j = 1, . . . , K . Thus where Σ is given in the statement of Theorem 3.1. Let Σ be a consistent estimator of Σ.
The historic and current estimates of η can be aggregated, or combined, in many ways. Lemma 2, appearing in the Appendix, suggests using the estimator which is the MLE under normality assuming that the matrices Υ ηη and Σ are known. Note thatη where the weights W 1 and W 2 are the symmetric matrices which satisfy I = W 1 + W 2 with γ = lim(n/(n + m)). Thus (7) differs from the best linear unbiased estimator by at most an o p (1) term.
In the second step we findθ B by solving whereη is given by (7). We now have: Theorem 3.2 Letθ B be a solution to (9) whereη B =η is given in (7). Assume that the regularity conditions of Theorem 3.1 hold.
Although the mechanics are slightly more involved, the proof of Theorem 3.2 builds on the proof of Theorem 3.1. Moreover, the structures of the asymptotic variance matrices A and B are analogous with the exception that the variance matrix ρΣ appearing in A is replaced with
It is clear that whenever the model for D identifies ω, both (θ B ,η B ) and (θ A ,η A ) can be computed. Next, using the concept of the Loewner order we show the former is superior to the latter. Recall that the matrix V 1 is said to be smaller in the Loewner order compared with the matrix V 2 if V 2 − V 1 is non-negative definite (Pukelsheim 2006). This relationship is denoted by V 1 V 2 . Suppose now that V 1 and V 2 are the variances of two (asymptotically) unbiased estimators. Then V 1 V 2 implies that the estimator associated with V 1 is more efficient than the estimator associated with V 2 . This means, for example, that the confidence ellipsoid associated with V 1 lies within the confidence ellipsoid associated with V 2 .

Moreover, for any function
Theorem 3.3 indicates that, if possible, it is always asymptotically beneficial to estimate both θ and η using the data D collected in the study S. Moreover, Theorem 3.3 holds also when only a sub-vector of η is identified by the data D.
Remark 5 As noted by a referee, an alternative approach to Type II Problems would be to combine the first and second estimation steps. This can be done by simultaneously solving the estimating equations The system (12) is obtained from (3) by augmenting the estimating function Γ with the term m Σ −1 ( η − η). The latter is a pseudo-score equation which follows directly from (6). By appropriately modifying the proof of Theorem 3.2 it can be shown that (θ B ,η B ) have the same limiting distribution as the solution of (12). It thus follows that the estimator (7) is asymptotically efficient up to the first order.
Another variant of Type II Problems occurs when the data D is not available, but nevertheless the estimates (θ,η) from the current study as well as their estimated variance, i.e.,Υ is given. The objective is then to combine the current estimators (4) with the historical estimators (6). To this end we propose estimating θ bȳ whereη C =η is given by (7). The estimators (7) as well as (13) are motivated by Lemma 2 and Remark 8 appearing in the Appendix.
Theorem 3.4 Let (θ C ,η C ) T be defined by (7) and (13). Suppose further that (4) and (6) hold and both Υ and Σ can be consistently estimated. Then as n → ∞ we have The matrices W 1 and W 2 are defined in (8) and R = Υ θη Υ −1 ηη . Moreover, we have: Theorem 3.4 describes the large sample behavior of the estimators (7) and (13). Further insight is facilitated by considering the simplest possible situation, i.e., when It is not hard to see that (13) Furthermore C θθ simplifies to where r = υ θη /(υ θθ υ ηη ) is the asymptotic correlation betweenθ andη and is the limiting value of w * 2 as n/(n + m) → γ . It follows that the asymptotic relative efficiency ofθ toθ is 1 − w 2 r 2 , which is at most unity (when υ θη = 0) and no less than 1 − r 2 (when γ is close to 0). Clearly, the historical estimates are useful only if the covariance υ θη is non-zero and highly useful whenever w 2 is close to unity. A similar but more involved analysis applies when the parameters are multidimensional.
We emphasize that the structure of the estimatorsη C andθ C as well as the form of C are related to, but much more general, than results obtained in the literature on both double sampling and monotone missing normal data (Anderson 1957;Morrison 1971;Kanda and Fujikoshi 1998). Double sampling is a widely used technique in survey sampling, where the estimator is also known as the generalized regression estimator (Thompson 1997), as well as in other applications, cf. Davidov and Haitovsky (2000), Chen and Chen (2000) and the references therein. We also note that Eq. (15) is a generalization of the formulas obtained for the usual double sampling estimator (e.g., Tamhane 1978) where w 2 = m/(n + m). The following Theorem substantially generalizes on results obtained in the literature on both the double sampling and monotone missing data.
In words, the estimator (θ C ,η C ), incorporating the historical estimates and derived by combining (θ,η) and η, is more precise than (θ,η), the estimator based only on the current study.
Remark 6 It is also important to emphasize that in finite, typically small samples, the estimator C θθ may be in fact inferior to Υ θθ . This typically occurs when the "regression matrix" R, see the statement of Theorem 3.4, is poorly estimated. This feature has been also recognized in the double sampling literature (Tamhane 1978).
A little algebra shows that so we can remove the dependence of C on the matrices W 1 and W 2 . Clearly, whenever the data D is available both (θ B ,η B ) and (θ C ,η C ) can be calculated whereη B =η C are given in (7).
Similarly, we can viewθ C as a solution to an esti- The form of Λ can be easily deduced from Lemma 2 and that of λ by plugging in the influence functions forθ andη into Λ. In fact, the precise form of the influence function ofθ C is readily derived, for more details see Remark 9 appearing in the Appendix. It is worth noting that Ψ operates on the full data D whereas Λ operates on functions thereof namely the estimators (θ ,η) and η. Thus (θ C ,η C ) can be viewed as functions of a coarsening of the data D and therefore is expected to be less efficient than (θ B ,η B ). This indeed is the case under mild regularity conditions. A formal statement requires the introduction of some additional notation. Let h = h(θ , η, Y ) denote any estimating function and denote D θ 0 (h) = E 0 (∂ h/∂θ) and D η 0 (h) = E 0 (∂ h/∂θ ). Note that earlier we referred to D θ 0 (ψ) and D η 0 (ψ) simply as D θ 0 and D η 0 . Now: Theorem 3.6 Suppose that bothω B andω C can be obtained. If component-wise and in the Loewner order, then Moreover, for any function Condition (18) holds when the estimating equation Ψ (θ , η 0 ) = 0 results in more efficient estimators for θ than those resulting from Λ(θ , η 0 ) = 0 when η = η 0 is set to its true value. This condition holds for any sensible choice of Ψ . In particular it holds for the score equations associated with maximum likelihood estimation. Condition (17) roughly means that Ψ is less sensitive to small perturbations in both θ and η compared with Λ. Conditions (17) are (18) are not necessary. For example, the conclusion of Theorem 3.6 may hold if ψ is more sensitive to small perturbations but at the same time much more efficient. We believe that the aforementioned conditions hold broadly and the estimators (θ C ,η C ), described in Theorem 3.4, are generally less efficient than (θ B ,η B ), described in Theorem 3.2. For an additional discussion see Remark 9 in the Appendix. There are, however, situations in which B = C and situations whereω B =ω C for any data D. As we shall see in the next section this is the case in normal linear models in which the estimators (θ ,η) and η are actually sufficient statistics. Finally, it is worth noting that if Υ θη = 0 then the estimator (θ C ,η C ) does not improveθ whereas there is always an improvement when the full data D is available.

Remark 7
To summarize, the proposed estimators are designed to extract as much information as possible from the data. Recall that in Type I Problems the parameter η is not estimable given the current data D. The pair (θ A ,η A ) solves the system of equations Ψ (θ A ,η A ) = 0 withη A = η where η is the historical estimator. In Type II problems the pair (θ B ,η B ) solves the system of equations Ψ (θ B ,η B ) = 0 whereη B , given by (7), is a weighted combination of η andη. Moreover, as noted earlier it can be shown that the resulting estimators are asymptotically efficient. Our approach to Type I and the first variant of Type II problems are of similar structure: plug into Ψ the best available estimator of η. In the second variant of Type II Problems the pair (θ C ,η C ) solves a different system of estimating equations which we denote by Λ(θ C ,η C ) = 0. These equations are the likelihood equations given in Lemma 2 assuming that the variance matrices Σ and Υ are fully known. Since (θ C ,η C ) are based on further coarsening of the data they are generally less efficient.

Illustrations, applications and numerical results
In this section two applications are discussed in detail. In Sect. 4.1 the two-way ANOVA problem introduced in Sect. 2 is investigated. In particular, various design options for the current study S are evaluated. It is worth noting that although the abovementioned ANOVA problem is among the simplest possible, its analysis is far from trivial. Next, in Sect. 4.2 we discuss the use of historical estimates in the design of drug interaction studies in the context of Bliss independence. A simple algorithm for the design of such studies is proposed.

Two way ANOVA
Recall the ANOVA model of Sect. 2 where the studies S 1 and S 2 were designed to estimate η 1 = (η 0 , η 1 ) T and η 2 = (η 0 , η 2 ) T , respectively. Note that the parameter η 0 is estimated in both studies so η 1 and η 2 are not distinct. Therefore employing any of the aforementioned findings requires the aggregation of the historical estimates as if they came from a single experiment. The historical studies result in the estimates ( η 0 (S 1 ), η 1 (S 1 )) and ( η 0 (S 2 ), η 2 (S 2 )) as well as their standard errors. Given these we can easily back calculate the unobserved means and sample sizes in the studies S 1 and S 2 and estimate (η 0 , η 1 , η 2 ) by: where the quantityȲ j (S i ) is the average response on treatment j ∈ {0, 1, 2} in study i ∈ {1, 2} and m i, j is the size of of treatment group j in study i. Under the usual conditions for some matrix Σ. Furthermore, if (1) is a homoscedastic model with variance σ 2 and m 1,0 = m 1,1 = m 2,0 = m 2,2 , i.e., the studies S 1 and S 2 are balanced and of the same size, then it is easy to see that We will now investigate various designs for a new study S. When the primary focus of S is inference on θ then it may be advantageous in some circumstances to allocate all n observations to the treatment arm receiving both treatments one and two, i.e., T 1 = T 2 = 1 for all observations. This is clearly a Type I problem since ω is not identifiable from D but given η the parameter θ is estimable. Note that an unbiased estimator for θ is and it is not hard to see that (21) solves (2) when ψ(θ, η 0 , η 1 , η 2 , Thus, Σ ψ = σ 2 , D θ 0 = 1 and D η 0 = −(1, 1, 1) and it follows that A θθ , the asymptotic variance of (21) as described in Theorem 3.1, reduces to The second term appearing in the parentheses in the above display is an inflation factor, i.e., the price to pay for substituting estimates for the unknown value of (η 0 , η 1 , η 2 ). Note that when n/m → 0 as both m → ∞ and n → ∞ the asymptotic variance ofθ A approaches σ 2 . In practice this requires a large current study and even larger historical data. Incidentally, sinceθ A is a linear function ofȲ 12 (S) and ( η 0 , η 1 , η 2 ) it is not hard to see that its exact variance is σ 2 (1/n + 10/m) which coincides with the asymptotic form.
Alternatively, suppose that the study S allocates n/4 observations to all treatment combinations. In this case the data D identifies ω = (θ, η 0 , η 1 , η 2 ) T , so this is a Type II problem. The usual estimators for this design areη 0 =Ȳ 0 (S), and thus the limiting variance of (θ,η) T is Next we aggregate the historical and current estimates for η. As in Sect. 3 we estimate η byη = W 1η + W 2 η where Note that the weight matrices are functions of the variances Υ ηη and Σ as well as the ratio n/(n + m). Since D is fully available to us then we can estimate θ bȳ Note that the estimators (21) and (22) are of the same functional form. Further note that the statisticȲ 12 (S) in (22) is a function of the n 12 observations Y 1 , . . . , Y n 12 receiving the treatment combination T 1 = T 2 = 1. A straightforward calculation shows that B θθ is given by  where ξ 11 is the fraction of the observations which are assigned to receive both treatments. In situations where the full data is not available to us but (θ,η 0 ,η 1 ,η 2 ) are known we may estimate θ byθ C =θ − Υ θη Υ −1 ηη (η −η). It can be verified that in this application, in which a normal linear model is involved and all estimators are functions of sufficient statistics, the estimatorsθ B andθ C coincide. Thereforeθ C is not discussed any further. Table 1 provides a comparison of the asymptotic variances of (21) and (22) for a range of values of m and n. Table 1 displays asymptotic variances; the variances themselves are found by dividing any entry in the table by the size of the current study in the relevant row. Observe that both A θθ and B θθ decrease as a function of m for any fixed value of n and increase in n for any fixed m. For example when n = m = 100 then A θθ = 11 and B θθ = 9.3 whereas when m = 100 and n = 5000 then A θθ = 501 and B θθ = 15.69 and when m = 5000 and n = 100 then A θθ = 1.2 and B θθ = 4.2. Thus going down the first column of Table 1 the asymptotic variance A θθ increases by a factor of approximately 45 whereas that of B θθ by the much more modest 1.4. Similarly going across the first row the asymptotic variances of A θθ and B θθ are reduced by a factor of 9.2 and 2.2 respectively. Each pair (n, m) provides a direct comparison between the two proposed designs (design A, say, in which all experimental units in the current study receive both treatments and design B, say, which is a balanced design). Clearly, design A seems preferable in situations where m is much larger that n, otherwise design B is to be preferred.
We now look a bit deeper into the question of optimal design. Suppose for simplicity that the historical sample satisfies m 1,0 = m 1,1 = m 2,0 = m 2,2 . Note that Table 1 considers only designs with ξ = (0, 0, 0, 1) and ξ = (1/4, 1/4, 1/4, 1/4). Therefore we next consider designs for the study S where ξ = (ξ 00 , ξ 10 , ξ 01 , ξ 11 ) is any value in the unit simplex. Clearly here ξ i j denotes the proportion of observations who received treatment combination i × j where i and j are in {0, 1}. It is not hard to see that the optimal design in the interior of the simplex, i.e. for a Type II Problem, is attained by Theorem 3.2 when ξ −1 11 + 1 T (Υ −1 ηη + (ρΣ) −1 ) −1 1 is minimized where Symmetry considerations imply that under optimality ξ 01 = ξ 10 and since ξ 00 = 1 − 2ξ 10 − ξ 11 the minimization involves only a two dimensional search. Table 2 provides the optimal design, i.e., the vector ξ for estimating θ for various values of the ratio ρ = n/m found by a grid search with step size 0.001 and the restriction that ξ 00 ≥ 0.02. This restriction is necessary; otherwise the matrix Υ can not be inverted. Table 2 provides the optimal design for estimating θ as a function of the sampling ratio. Note that for large ρ, i.e., when n is larger than m, we find that ξ 01 = ξ 10 = 1/4 and that the difference between ξ 11 and ξ 00 decreases in ρ. We believe that the balanced design is optimal when ρ → ∞. It is also clear that for large ρ the designs appearing in Table 2 are generally superior to those in Table 1. For example when ρ = 1 we find that the asymptotic variances in Table 1 are 11.0 and 9.3 whereas the corresponding optimal asymptotic variance given in Table 2 is 8.0. However, for values of ρ smaller than a 1/4, i.e., when n is rela.tively small to m, then the optimal design sets ξ 00 = 0.02, and ξ 01 = ξ 10 = 0.001 which are the smallest possible values allowed by our search algorithm. This suggest that further, at most minor, improvements are possible by setting ξ 00 = 0 and/or ξ 01 = ξ 10 = 0. Clearly when ξ 01 = ξ 10 = ξ 00 = 0 we have a Type I Problem.
Therefore we next consider the situation that ξ 00 = 0 and ξ 01 = ξ 10 > 0, in which case the current study comprises of three arms and thus three group means:Ȳ 1 (S), Y 2 (S) andȲ 12 (S). We emphasize that this estimation problem is neither a Type I nor a Type II problem. Further observe that with these data alone we can not estimate ω. Nevertheless, the pair (Ȳ 1 (S),Ȳ 2 (S)) T whose mean is (η 0 + η 1 , η 0 + η 2 ) can be aggregated with η the historical estimate of η. By an appropriate modification of Lemma 2 it can be shown that η can be estimated by where S = (Ȳ 1 (S),Ȳ 2 (S)) T , V = σ 2 diag(ξ −1 01 , ξ −1 10 ) is its asymptotic variance and A = 1 1 0 1 0 1 .
is the matrix which satisfies E(S) = Aη. Note that (23) is of the same form as (7) but with A T V −1 A instead of Υ ηη . Now, letθ D denote the solution to Ψ (θ, η † ) = 0 which is nothing butθ A straightforward calculation shows that the asymptotic variance ofθ D is given by The formula above is useful in finding the optimal design for small values of ρ when ξ 00 = 0. For example when ρ = 1/8 then the design ξ = (0, 0.0005, 0.0005, 0.9990) results in a variance of 2.25 (actually 2.250751) which is slightly smaller than 2.27 the reported variance in the first row of Table 2. Finally we note that when ρ = 1/8 then A θθ equals (precisely) 2.25 which means that in this application a design for Type I Problem would be the most effective. As noted by a referee, in addition to the above mentioned three arm trial one could choose various two arm designs for S. For example, one can choose a design for which ξ 00 > 0 , ξ 11 > 0 and ξ 01 = ξ 10 = 0. Or alternatively designs for which ξ 01 > 0, ξ 11 > 0 and ξ 00 = ξ 10 = 0 . It can be shown, however, that for small values of ρ, where such designs are of interest, these two armed designs are not superior to a Type I design.

Using historical estimates in drug interaction studies
This subsection deals with the optimal design of drug interaction studies. Consider two drugs D 1 and D 2 with no-effect probabilities η 1 and η 2 , respectively and let θ denote the no-effect probability when both drugs are administered together. The drugs are called Bliss independent, see, Bliss (1939), Liu et al. (2018) If (25) does not hold and θ < η 1 η 2 there is synergy among the drugs, otherwise there is antagonism. The concept of Bliss independence has seen a recent resurgence of interest as the need to assess the benefit of combination therapies and drug-drug interactions has increased. Some current references are Pallmann and Schaarscmidt (2016), Palmer and Sorger (2017), Russ and Kishony (2018) and Niu et al. (2019). Drug interaction studies are often carried out as single-dose experiments, e.g., Ansari et al. (2008), where the interaction is assessed by considering a single dose of each of the two drugs. A more elaborate design, which we will not consider here, assesses multiple drugs and doses using response surface methodology as in Lee (2010).

Naturally, the quantity of interest in drug interaction studies is
The formulation in (26) links the problem discussed here to the ANOVA setup considered earlier. In many applications of single dose interaction tests, whether using historical data or not, an explicit or implicit asymptotic argument is used, and the theoretical results for the asymptotic case presented above are relevant. For example, Demidenko and Miller (2019) describes a Daphnia acute test with two stressors, single doses of CuSO4 and of NiCl, where the numbers of surviving organisms in water were counted after 48 hours. The observations reported were the surviving fractions of organisms only, without reporting their original numbers thus, essentially, assuming their original numbers were very high, i.e., applying an asymptotic argument. But as pointed out by Pallmann and Schaarscmidt (2016), in single-dose experiments, correct statistical analysis should rely on the observed frequencies, and not on the observed rates of success or failure. Therefore the sample sizes used in each arm of the experiment are of crucial importance and in this subsection we provide finite sample results. For simplicity suppose that there exist historical estimates of η 1 and η 1 based on independent binomial experiments with sizes m 1 and m 2 . Suppose further that the current study allows for the recruitment of n experimental units, n 1 of which will receive D 1 , n 2 will receive D 2 and n 12 will receive both drugs. Obviously n = n 1 + n 2 + n 12 (27) and θ can not be estimated unless n 12 > 0. However it is possible that n 1 = n 2 = 0. The goal is to allocate the experimental units optimally, which is equivalent to the problem of optimally allocating n + m 1 + m 2 observations in an experiment in which the single dose arms are no smaller than m 1 and m 2 , respectively. The optimal design problem can be approximated as the minimization of the large sample variance of (26) 1 n 12 subject to the constraint (27).
In contrast with the design problem encountered in Sect. 4.1 the design criterion depends on the unknown parameters, i.e., the probabilities θ and η. We propose allocating observations as if η 1 = η 1 , η 2 = η 2 and θ = η 1 η 2 is equal to its estimated value under the hypothesis of Bliss independence.
One simple approach to the minimization of (28) is the following greedy iterative procedure, which sequentially allocates observations into the condition where the variance is reduced most: ALGORITHM n 12 ← 1, n 1 ← 0, n 2 ← 0 if n = n 1 + n 2 + n 12 then stop else R 12 (n 12 ) ← 1 n 12 For example, if m 1 = 30, m 2 = 50, η 1 = 0.7 and η 2 = 0.8, and one had 56 observations, then 55 observations would be put in the arm where both treatments are administered, and only one would be used to improve the estimate ofη 1 . Table 3 contains a tabulation of the optimal allocation of (n 12 , n 1 , n 2 ). For selected combinations of the values of m 1 , m 2 , η 1 , η 2 the table gives the minimal value of n, denoted as n min , for which replications of the historic observations are needed, and then the optimal allocation for n min . As one would expect, when θ is closer to 0.5 than η 1 or η 2 , a larger sample size n 12 is allocated in the optimal design to estimating θ , than m 1 or m 2 . In the opposite case, n 12 is smaller than m 1 or m 2 .
As pointed out by a reviewer, the optimal design may not be unique. For example, in the first entry of the table, when η 1 = η 2 = 0.3, m + 1 − m 2 = 10, and one has 23 observations to allocate, 22 observations should be used for estimating joint effect θ and one observation should be used to improve the estimate of either of the individual effects η 1 or η 2 .

Summary and discussion
This paper focuses on the situation when historically available information in the form of parameter estimates are either incorporated in the analysis of a current study or used to plan a future experiment. We did not explicitly discuss the large literature on data combination schemes or other two-stage plug-in methods. As mentioned, when Table 3 The minimal sample size n min , at which optimal allocation requires to repeat historical observations, for selected values of the no-success probabilities η 1 , η 2 and historical sample sizes m 1 , m 2 in a test of drug interaction m 1 m 2 n min n 12 n 1 n 2 m 1 m 2 n min n 12 n 1 n 2 historical estimates are incorporated in an analysis their variability is rarely accounted for in the analysis. A partial list of examples, drawn from the scientific literature was furnished earlier and many more exist. However, it seems very difficult to find published research where the details are given to the extent which would make the replication of the analysis possible. This limits one's ability to apply the results of this paper to published research. However, the results presented here will inform future researchers of the scope and use of historical estimates and provide a toolkit for doing so. We hope that our investigation may have an effect on the quality of future analyses and publication standards. We also agree with a referee who has noted that an estimate is always a coarsening of the full data, and it is clear that having only access to a historical estimate instead of the entire data leads to less efficient inferences.
Different disciplines exhibit different modes of using historical estimates. Social scientists often incorporate estimates from surveys in the process of model fitting, whereas biologists and engineers may use parameters estimated in experiments which are very different than their own. One way, of course, of incorporating historical esti-mates is using prior distributions within the Bayesian framework. For recent examples see Hoff (2019) and Bryan and Hoff (2020). Our approach, however, is frequentist, as are most of the applications in the literature. In particular, we show how to incorporate historical estimates in a principled way in scenarios which we classify at Type I Problems, where the historical parameters are not re-estimated, and Type II Problems, where they are. Two variants of Type II Problems are described. See Theorems 3.1, 3.2 and 3.4. We also show that if, given the current data D, it is possible to re-estimate the historical parameters then it is beneficial to do so at least for large sample sizes (Theorem 3.3). Other preference relations, in fact a hierarchy, among the estimators and any function thereof, were also established, cf. Theorems 3.5, 3.6. The loss of precision in the above mentioned settings is quantified in terms of a decomposition of the variance matrices. It was also demonstrated that the availability of historical estimates should be taken into account when an optimal experiment is designed. In particular, relevant methods for a two-way ANOVA and for testing drug interaction were discussed. Thus the results of this paper go beyond the existing knowledge on the use of historical estimates.
In our analysis we have assumed that the current data D is a random sample and that the estimating Eq. (2) is of an additive form. These assumptions have been used merely to simplify the exposition and are easily modified to dependent data and various other estimating functions. It is clear that Type I and II Problems describe a broad range of possibilities, nevertheless they are insufficient for describing the rich collection of problems in which historical estimates may play a role. For example, our formulation assumes that the historical parameters η 1 , . . . , η K are distinct. However, in many situations this is not so. In fact, some of the historical studies may be full or partial replicates of each other. In cases when the current study is a partial replicate of a historical study, simple plug-in methods or re-estimation methods may be used. One has to be careful, though, about the choice of the estimates. We are aware of situations where a simple plug-in estimator performs better than a less than optimal re-estimating method. Throughout, we have assumed interchangeability. Clearly there are many experimental settings, especially in the sciences, where this assumption is realistic. In other situations, say clinical trials, heterogeneity rather than interchangeability is the rule. In such cases some modification of the methods proposed, using random effect models, may be possible. See Rukhin (2007) and the references therein.
Finally, it is also worth mentioning that the problem of accounting for historical estimates is naturally related, for obvious reasons, to sequential analysis, where data is collected over time, to meta-analysis, where the effort is to combine information from different sources and double sampling, and especially non-nested double sampling (Hidiroglou 2001), which attempts to provide better inferences by augmenting and predicting unobserved quantities from existing data sets. The literature on combining surveys (Kim and Rao 2012) is also relevant. Further understanding can be possibly attained by incorporating ideas from these fields. attention to the importance of Bliss independence and to the anonymous reviewers for their insightful and detailed comments.

Funding Open access funding provided by Eötvös Loránd University.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix: Proofs
Proof of Theorem 3.1: By assumption ψ is continuous and differentiable with respect to θ and η 1 , . . . , η K . Thus, so is Ψ . Hence, by the mean value theorem Applying the mean value theorem to Ψ (θ 0 , η) in the display above yields so (29) can be rewritten as . Now, by the continuous mapping theorem and the law of large numbers we have: which is a p × q matrix. For convenience we set D θ 0 = E 0 (∂ψ/∂θ ) and D η 0 = E 0 (∂ψ/∂η). Hence we can reexpress (29) more concisely as from which it follows, by the invertibility of D θ 0 , that Since the first term in the curly brackets above is a function of the data D collected in S and the second term depends on the historical data, i.e., the studies S 1 , . . . , S K the two terms are independent. Now, by the central limit theorem Collecting terms shows that as required, completing the proof.
for some function ϕ which is known as the influence function (cf. Van der Vaart 2000). For more details see Remark 9. It follows by the central limit theorem, that are asymptotically jointly multivariate normal, thus so are the first two terms in the curly brackets of (34). Moreover the third term, which depends on the historical data is independent of the first two terms and normally distributed. Now the covariance among the first two terms is However, by assumptionη is asymptotically unbiased and √ n-consistent, i.e., E 0 (η) Plugging the latter into the above display shows that covariance above converges to 0 as n → ∞. It now follows that all three terms appearing in (34) are asymptotically independent.

Proof of Theorem 3.3:
The following preliminary Lemma will be used.
Lemma 1 Let X 1 and X 2 be random vectors with variances V 1 = V(X 1 ) and V 2 = V(X 2 ) with V 1 V 2 . Then for any matrix A we have V( AX 1 ) V( AX 2 ). As a consequence we also have V −1

Proof of Lemma 1:
Proof Observe that for any vector u. The equalities above hold since products of symmetric matrices commute. The inequality V −1 1 V −1 2 follows immediately.
Next, an application of the δ-method and Theorems 3.1 and 3.2 shows that √ n(Φ(θ A ,η A ) − Φ(θ 0 , η 0 )) ⇒ N r (0, P AP T ) and √ n(Φ(θ B ,η B ) − Φ(θ 0 , η 0 )) ⇒ N r (0, P B P T ) where r is the dimension of Φ and P = E 0 (∂Φ/∂ω). Observe that P AP T is the variance of the random vector P T 1 whereas P B P T is the variance P T 2 . Since B A it follows from Lemma 1 that P B P T P AP T concluding the proof.

Proof of Theorem 3.5:
Proof Equation (42) and Lemma 1 imply that C Υ as stated. The fact that V Φ C V Φ Υ now follows as in Theorem 3.3.

Proof of Theorem 3.6:
Proof By Equation (33) Therefore comparing the estimatorsθ B andθ C amounts to comparing their influence functions implicit in (45) and (46), i.e., respectively. Also note that so although in principle it is possible to always compare the above influence functions in practice this comparison is very difficult unless some further simplifying assumptions are imposed.